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Abstract 



We propose efficient algorithms for two key tasks in the analysis of large 

' ^ ' nonuniform networks: uniform node sampling and cluster detection. Our 

S ! sampling technique is based on augmenting a simple, but slowly mixing 

fj ' uniform MCMC sampler with a regular random walk in order to speed up 

its convergence; however the combined MCMC chain is then only sampled 
when it is in its "uniform sampling" mode. Our clustering algorithm de- 
^ [ termines the relevant neighbourhood of a given node u in the network by 

00 ' first estimating the Fiedler vector of a Dirichlet matrix with u fixed at zero 

tJ" I potential, and then finding the neighbourhood of u that yields a minimal 

^^ ' weighted Cheeger ratio, where the edge weights are determined by differ- 

JS^ I ences in the estimated node potentials. Both of our algorithms are based 

._!. ' on local computations, i.e. operations on the full adjacency matrix of the 

(^ I network are not used. The algorithms are evaluated experimentally using 

three types of nonuniform networks: Dorogovtsev-Goltsev-Mendes "pseud- 
2 ! ofractal graphs", scientific collaboration networks, and randomised "cave- 

man graphs". 



Q ■ 1 Introduction 



Two key tasks in the analysis of large natural networks, such as communication 
networks and social networks, are obtaining a uniform sample of nodes in the net- 
work, and determining the densely interconnected clusters of nodes. Uniform sam- 



P^ \ pling is important e.g. for the purpose of estimating basic network characteristics 
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such as the degree distribution, average path length, and clustering coefficient; it is, 
however, nontrivial to obtain a truly uniform random sample of nodes from a large, 
practically unobtainable network such as the WWW |fT3!|. In this paper, we suggest 
an efficient approach for uniform sampling of undirected nonuniform graphs, using 
a construction that combines two types of random walks to produce one that mixes 
rapidly and still converges to the uniform distribution over the set of nodes. 

We also discuss the problem of clustering nonuniform networks, i.e. the recog- 
nition of subgraphs where the nodes have relatively many edges among themselves 
and relatively few edges connecting them to the rest of the graph |13. For large 
nonuniform networks, an effective clustering algorithm should scale at most lin- 
early in the size of the graph, and for many applications, a method for determining 
the local cluster of a given source node will suffice, rather than a complete clus- 
tering of the entire graph. In this paper, we use approximate Fiedler vectors to 
determine potentials around a given source node, and then use the potentials to 
stochastically select an appropriate local cluster. 

In Section 2, we present the MCMC construction for uniform sampling, and 
in Section 3 discuss experiments performed with the method. Section 4 discusses 
local clustering with Fiedler vectors. Finally, Section 5 summarises the work and 
addresses directions for further research. 

2 An Efficient MCMC Method for Uniform Sampling 

Let G = (y, E) be a connected symmetric simple graph with n nodes and m edges. 
We denote the neighbourhood of node i ^Vhy T{i) = {j ^V \ {i, j) G E}, and 
the degree of i by deg(i) = |r(i)j. It is well known (and easy to verify) that the 
regular random walk on G, with transition probabilities 

Pij = < deg(z) (1) 

I 0, otherwise, 

satisfies the detailed balance conditions 

Vi, j € F : TTj ■ Pij = TTj ■ Pji (2) 

with respect to the distribution ttj = deg(i)/2m,, and hence this distribution, which 
we denote by ttrw, is stationary w.r.t. the walk l4l fT4ll . If G is non-bipartite, then 
ttrw is the unique equilibrium distribution. The chain Q mixes rapidly, but the 
probability of obtaining any given node f as a sample from it is proportional to the 
degree of i, and thus not uniform unless G is regular-. 



A straightforward approach to uniform sampling Q would be to augment the 
nodes of G with virtual self-loops so as to make them all have the same degree 
d = maxjgy deg(i). This method, however, requires knowing the target degree 
d ahead of time, and such global information is typically not available in many of 
the interesting applications. Moreover, this process may create some convergence 
anomalies in the case of highly nonuniform graphs G. Another alternative fOlETl 
would be to postprocess a sample obtained from walk Q in order to compensate for 
the bias in the stationary distribution ttrw- Such postprocessing, however, requires 
some a priori information on the number of burn-in steps needed before one can 
obtain a representative sample from ttrw, and the bum-in time again depends on 
the global structure of G. 

We take a complementary approach, by starting from a somewhat more slowly 
mixing random walk on G with a provably uniform stationary distribution, and 
then "accelerate" this walk by coupling it together with the chain (Q; however we 
only sample the combined process when it is in the "uniform sampling" mode. 

More precisely, we take as our starting point the following degree-balanced 
random walk on G, where the transition probabilities from node i are inversely 
proportional to the degree of the target node j: 



Pij 



ifjGr(i), 



deg(z) • deg(j) ' 



.'.. deg(i) • deg(j) ' 

jGr(j) 

^ 0, otherwise. 



It is simple to verify that the transition probabilities pij given by Q satisfy the 
detailed balance conditions with respect to the uniform distribution 7riD(i) = 1/n, 
and hence ttid is a stationary distribution for this chain. (Note that in this case the 
equilibrium distribution is unique for any G with more than two nodes, since any 
node i with non-leaf neighbours has a self-loop probability ofpa > 0.) 

However, this degree-balanced walk avoids visiting the high-degree nodes of 
a nonuniform graph, and so mixes relatively poorly in the graphs of most interest 
to us. A related problem is that the self-loop probabilities pu we. rather lai^ge for 
nodes with many high-degree neighbours. ^ 

In order to construct a sampling method that produces uniformly distributed 
samples but avoids the convergence problems of chain ^, we propose the foUow- 

' These problems could be alleviated somewhat by using the Metropolis-Hastings chain proposed 
in (51, with Pij = min{l/di, 1/dj} for j £ r(i), instead of our degree-balanced chain. However, 
as illustrated in Figure |5| below, both chains have qualitatively similar convergence behaviour, and 
the arithmetic of coupling to the regular random walk is somewhat simpler for the degree-balanced 
version. 




Sampling side p»j Pj^ p*'/ \pfi> Mixing side 



Pj'j' 



Figure 1 : A diagram of the mirror construction for two nodes i and j on the sam- 
pling side and their mirror nodes i' and / on the mixing side. 

ing construction (cf. Figure [B: for each node i e y we create a "mirror node" 
i' . The original nodes i G 1/ are called the "sampling side" and the mirror nodes 
i' € V are the "mixing side" of the augmented graph {\V\ = \V'\ = n). We 
continue to denote by deg(i) = deg(i') the degree of i in the original graph G, 
i.e., ignoring the added edges that connect the two sides. 

The transition probabilities on the sampling side follow those of the degree- 
balanced random walk; on the mixing side, a regular random walk is mimicked 
with minor modifications. The exact transition probabilities are defined as follows: 
let e be a parameter satisfying < e < pu for all i € ^ — further restrictions 
on e are discussed later in this section. Fix all the sampling-to-mixing transition 
probabilities pai to e. On the sampling side, subtract e from each pu and give all 
other transition probabilities the values they would have in the degree-balanced 
walk. On the mixing side, denote the probability of moving back to the sampling 
side from nodes i' by pj/j = e^. Let 5 be a parameter (to be determined later) 
such that S > e'- for all i' G V. Add to each node i' G V a. self-loop with 
transition probability pj/j/ = 6 — e[, and divide the remaining probability mass 
1 — 6 evenly among the neighbours of i' as in a regular random walk, i.e. assign 
p,/jv = (1 - 5)g^ for each f G T{i') \ {i}. 

We claim that the stationary distribution of such a combination walk is a weighted 
combination of the distributions ttid and ttrw, such that an a-fraction of the time 
the chain is in a state i £ V, and an (1 — a)-fraction of the time is spent within V': 

,. f a ■ TTiuix) = a ■ ^, ifx = ieV, 

t (1 - a) • TTRwlx) = (1 - a) • -^, if X = ^' G V . 

To verify the claim it suffices to check the detailed balance conditions ^ for the 
above construction. There are three cases to consider: (i) transitions within V, (ii) 
transitions within V , and (iii) transitions between V and V . 

The first two cases are essentially the same as those considered in the settings 
of the balanced and regular random walks, respectively: only some constant coeffi- 



cients (a, (1 — a), {1 — 6)) appear on both sides of the balance equations and cancel 
out. This leaves us with the third type: here the requirement is that any transitions 
between a node i G V and its mirror node i' € V satisfy 7rc(i) -pui = 7rc{i') -Pi'i, 
i.e. that 

a- ■ e = {1 - a)'^^^ ■ e[, for alH G F. (5) 

n 2m 

These equations can be satisfied by solving for the transition probabilities e[, once 

values for the parameters a and e have been chosen: 

, 2mae 2m a i 

'^ = n(l-a)deg(.) = V ' JT^' '^'^^'^~ ' ^^^ 

where ^ = fc is the average degree of nodes in G. As a probability, e^ must be at 
most one for all i ^ V. This yields an additional restriction on the parameter e: 

n 1 — (y 11 — a 

€ < • deg(i) = ^ ■ deg(z), for all i eV. (7) 

2m a k a 

Since deg(i) > 1 for all i G V,it suffices to choose e < k"^^-^. For a (nonuni- 
form) graph, averaging over a regular random walk will quickly give a positively 
biased estimate for k that can be used to bound e; note that many nonuniform 
networks have a modest average degree, despite the existence of a few extremely 
high-degree nodes. 

In an implementation of the above sampler one does not of course make explicit 
copies of the node sets, but rather uses a state flag that indicates which set of 
transition probabilities should be applied. All the transition probabilities are locally 
computable at each node i, if the parameters e and a are given, and the degrees of 
both the node i and its neighbours in r(i) are accessible. The dependency of e[ 
on the parameter a and the average degree k = 2ml n can be resolved by simply 

always setting e' = -— , which implicitly fixes the relationship 

deg(i) 

a 2m 1 

1 ^ a = ^ . (8) 



\ — a n fc + 1 

This implies by equation Q the condition e < 1, which is a natural restriction on 
e. By this choice of e^ we also have e^ < e for all i G V, and may thus set 6 = e, 
completing the definition of the transition probabilities on the mixing side. 

3 Sampling experiments 

In this section, we report on experiments using the above sampler construction on 
both artificial networks with known properties (so called "pseudofractal graphs" 




Figure 2: The DGM pseudo-fractal graphs Gt (adapted from (SI). Newly added 
nodes are drawn white. 

of Dorogovtsev, Goltsev, and Mendes [8'|), and scientific collaboration networks of 
n = 503 and n = 5,909 mathematicians and computer scientists, with total number 
of coauthorships m = 828 and m = 13,510 respectively (subgraphs of the network 
constructed in f24|). 

In the deterministic scale-free network model of Dorogovtsev, Goltsev, and 
Mendes [8] (based on |2l), the initial graph G_i = {V-i,E^i) consists of two 
nodes v and w and an edge {v, w). At each generation t > of the generative 
process, per each edge (u, v) € Et^i, a new node lu is added together with edges 
(n, w) and {v, w). (See Figure|2lfor an illustration of the first five generations.) The 
resulting graphs Gt have an almost constant average degree of fet = 4(1 + 3"*), 
yet a power-law distribution of node degrees according to nt{d) = 3*~*'^(i~^°S2 3 

As a first indication of the behaviour of various sampling strategies. Figures |3l 
and 0] present plots of the percentage of graph covered versus length of the walk, 
for DGM networks G5, G7 and Gg. Note that the combination walks sample fewer 
nodes during a walk of a given length than the others, as it does not record samples 
during the mixing phase. The tendency of the degree-balanced method to unwanted 
locality is quite evident in Figure |3] 

In another set of experiments, we estimated the rate of convergence of the 
above discussed random walks to their respective stationary distributions. If iTt is 
the distribution of a random walk after t steps, and vr is its stationary distribution, 
the total variation distance between the two is defined as l4lfT4l: 



A(i) =max|7ri(5) -vr(5)| = i J^ |^,(i) - vr(i)|. (9) 

We estimate this quantity by running k independent instantiations of a given ran- 
dom walk starting from the same start node, and looking at the state distributions at 
time t of the instantiations. For definiteness, let us consider the case where the sta- 
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Figure 3: The coverage achieved by the regular (top row) and the degree-balanced 
walks at each step for DGM graphs of generations 5, 7, and 9. In each plot, 30 
independent walks are shown. 

tionary distribution is uniform, with 7r(i) = 1/n for all i £ V. Denoting by ft{i) 
the number of instantiated walks that are visiting node i at time t, a conservative 
estimate of the total variation distance at time t can then be computed as |i5J : 



Aest(t) 



1 



E 



mill 



Mi) 1 



k 



n 



(10) 



Figure |5l shows the time evolution of these estimates for the regular, balanced, 
combination random walks and the Metropolis-Hastings walk of [3J in DGM graphs 
of generations five and seven, and for the two collaboration graphs of n = 503 and 
n = 5,909 scientists. The stationary distribution for the regular walk is taken to be 
the degree-proportional distribution ttrw, and for the three other walks the uniform 
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Figure 4: The coverage achieved by the combination random walk on a ninth gen- 
eration DGM graph for different values of e. 
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Figure 5: Values of Aest(i) for the regular, degree-balanced, combination, and 
Metropolis-Hastings ("minimum-balanced") random walks over a set of 15,000 
independent walks in two DGM graphs and two collaboration graphs, all starting 
from a fixed node, initially chosen at random. Note logarithmic scale on the time 
axis. 



distribution vtid. For the combination walk, only those instantiated walks that are 
on the sampling side at any given time step are included in computing the corre- 
sponding estimate. The plots illustrate quite graphically (particularly in the case 
of the heavy-tailed DGM graphs) that the convergence behaviour of the combi- 
nation walk is qualitatively similar- to that of the regular walk, whereas both the 
pure balanced walk and the Metropolis-Hastings walk converge noticeably more 
slowly.^ 

4 Local clustering by approximate Fiedler vectors 

Another key task in the analysis of natural networks is finding clusters of densely 
interconnected nodes. Most of the existing literature on this topic (see fT9l for a 
survey) considers the task of finding an ideal complete clustering of a given graph. 
This is, however, often unnecessary and in any case infeasible in the case of really 
large networks such as the WWW. (The fastest complete algorithms can currently 
deal with networks containing up to maybe a few millions of nodes ll5lfT8lfT9l .) 
In many cases it would be sufficient to know the relevant cluster of a given source 
node, or maybe a group of nodes. Some recent papers, such as ll23l l25l address 
also this more limited goal. 

In 1231 l24l . a parameter-free local clustering quality measure is optimised us- 
ing simulated annealing: the computational effort needed to obtain the cluster of a 
given source node is quite modest (and, most importantly, independent of the total 
size of the network), and the results seem to be quite robust w.r.t. variations in the 
annealing process. In l25l . the clustering task is formulated as a problem of de- 
termining voltage levels in an electrical circuit with unit resistances coiTcsponding 
to the edges of the original network. The source node is fixed at a high voltage 
value and a randomly selected target node at low voltage; an approximate solution 
to the Kirchhoff equations is computed by an iteration scheme, and the eventual 
cluster of the source node is deemed to consist of those nodes whose voltages are 
"close" to the high value. The possibility that the tai^get node is accidentally se- 
lected from within the natural cluster of the source node is decreased by repeating 
the experiment some small number of times and determining cluster membership 
by majority vote. 

This electrical circuit analogue appears to have been first suggested in fT9l . 
where however the aim is to compute a complete clustering of a given network by 
considering all possible source-target pairs, and for each pair solving the Kirch- 
hoff equations exactly by explicitly inverting the coiTesponding Laplacian matrix. 

"There is some residual small-sample bias in the estimates; we have computed the size of this 
effect and will indicate these calculations in the extended version of this paper. 



(We note that since solutions of the Kirchhoff equations can be decomposed in 
terms of the eigenvectors of the circuit graph Laplacian, this method is a variant 
of the much-studied spectral partitioning techniques (9l[l0|[ni[l2l[l6l|20l|22l. A 
distributed algorithm for spectral analysis, possibly suited for large networks, is 
proposed in (TT\ . A fundamental reference is |51l.) 

We continue the analogue of representing cluster membership values as phys- 
ical potentials, but eliminate the unnatural choice of random "target" nodes by 
basing our model on diffusion in an unbounded medium rather than an electrical 
closed-circuit model. Thus, we fix the source node i at a constant potential level, 
which we choose to be zero, and find an eigenvector u corresponding to the small- 
est eigenvalue ui of the respective Dirichlet matrix, i.e. the Laplacian matrix of 
the network with row and column i removed Ii6 7J- This eigenvector n, called the 
(Dirichlet-)Fiedler vector of the graph, will now (hopefully) assign potential values 
u{j) close to for nodes j that are within a densely interconnected neighbourhood 
of the source node i, and larger values for nodes that have sparser connections 
to the source. (The method obviously generalises to starting from a larger set of 
source nodes, if desired.) 

Since we wish to develop a local algorithm, and not deal with the full adjacency 
matrix of the network, we approach the computation of the Fiedler vector u via 
minimising the Rayleigh quotient |6l|71: 

(y\ = mf^ — = — -—, , (11) 



u 



where the infimum is computed over vectors u satisfying the Dirichlet boundary 
condition of having u{i) = for the source node(s). (The notation j ~ A; is an 
abbreviation for (j, k) G E.) Furthermore, since we are free to normalise our 
eventual Fiedler vector to any length we wish, we can constrain the minimisation 
to vectors u that satisfy, say, ||ii||2 = n = \V\. Thus, the task becomes one of 
finding a vector u that satisfies: 

u = argmin< NJ(m(j) — n(/c))^ | u{i) = 0, \\u\\2 = n>. (12) 

We can solve this task approximately by reformulating the requirement that ||u||2 = 
n as a "soft constraint" with weight c > 0, and minimising the objective function 

fin) = ^E(^(j)-^W)' + f (^-E^(J')') (13) 

jr^k j 
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Figure 6: Three local Fiedler clusters in a caveman graph of 138 nodes. 

by gradient descent. Since the partial derivatives of / have the simple form 

df 



du{j) 



^u{k) + (deg(j) - c) • u{j), 



(14) 



the descent step can be computed locally at each node, based on information about 
the M-estimates at the node itself and its neighbours: 



ut+iij) 



MJ) + -^ • ( E ^(^) - (deg(j) - c) • uij)) , (15) 



fc~j 



where 5 > is a parameter determining the speed of the descent. Assuming that 
the natural cluster of node i is small compared to the size of the full network, 
the normalisation \\u\\2 = n entails that most nodes j in the network will have 
u{j) ^ 1. Thus the descent iterations (fTSl can be started from an initial vector uq 
that has UQ{i) = for the source node i and tto(^) = 1 for all k ^ i. The estimates 
need then to be updated at time i > only for those nodes j that have neighbours 
/c ~ j such that ut-i{k) < 1. 

Figure |6lrepresents the results of such approximate Fiedler vector calculations 
in the case of a slightly randomised "caveman" network of 138 nodes, starting from 
three different source nodes. For visual effect, the nodes are colour-coded so that 
dark colours correspond to small approximated Fiedler potential values, with the 
source node in each case coloured black. The parameter values used in this case 
were c = 0.1, <5 = 0.05. 

Visually, the clusters in e.g. Figure|6llook reasonable; in practice, however, we 
need to determine the cluster boundaries automatically. One possibility would be 
to threshold the potentials as in E51l . but we prefer not to introduce any additional 
instance-specific parameters to the algorithm. A natural alternative is to find a set 
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of nodes S that contains the source node i and minimises some weighted Cheeger 
ratio (51 p. 35]: 

""^ ' ~ ~T V wii k) ' ^ ^ 

where w{j, k) is an appropriate nonnegative edge weight function. In our experi- 
ments, edge weights determined as w{j,k) = {\u{j) — u{k)\)^^ seem to lead to 
natural clusters in different types of networks, and are also intuitively appealing. 
In Figure |51 we have indicated the nodes selected by this heuristic as belonging to 
each cluster by circles with thick boundaries. The minimisation of the cluster cost 
function (IT6t was here performed by a local simulated annealing process similar to 
the one used in l23ll24l. 



5 Conclusions and Further Work 

In this paper we presented two methods to help analyse properties of large nonuni- 
form graphs: a uniform sampling construction and a local method for clustering 
based on approximate Fiedler vectors. According to our experiments, both ap- 
proaches are well-behaving and conform to the intuition that arises from their ana- 
lytical properties. 

As future work, we will look into more general constructions for rapidly mixing 
uniform MCMC samplers; one direction might be to combine the regular random 
walk with alternative slow uniform samplers, such as those of |3 1. Accuracy of the 
estimates of natural network network characteristics based on our pseudo-uniform 
samples should also be assessed. Both the sampling and the clustering algorithms 
should also be extended to work on directed graphs, in order to deal with interesting 
natural networks such as the WWW. 
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